#Include the libraries we are going to need here
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
library(plyr)
#library(tidyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange() masks plyr::arrange()
## x purrr::compact() masks plyr::compact()
## x dplyr::count() masks plyr::count()
## x dplyr::failwith() masks plyr::failwith()
## x dplyr::filter() masks stats::filter()
## x dplyr::id() masks plyr::id()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
## x dplyr::mutate() masks plyr::mutate()
## x dplyr::rename() masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggcorrplot)
library(knitr)
library(splines)
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.0-2
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(grid)
library(RColorBrewer)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(gbm)
## Loaded gbm 2.1.8
library(arules) # for Association Rules analysis
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(gridExtra)
library(tree)
## Registered S3 method overwritten by 'tree':
## method from
## print.tree cli
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
library(arulesViz)
library(DMwR)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'DMwR'
## The following object is masked from 'package:plyr':
##
## join
nb.cols <- 18
mycolors <- colorRampPalette(brewer.pal(8, "BrBG"))(nb.cols)
cookiecol<-c('#ad6a1d','#9a5327','#cc8d4a','#4e1703','#ecc78d','#2e0a05','#d0dfe4','#33575b', '#173742')
The main dataset is the transaction record of a bakery (https://www.kaggle.com/sulmansarwar/transactions-from-a-bakery). Though not specified in the description of the dataset, it is implied that this bakery is located in the old town of Edinburgh, UK. The dataset is downloaded from kaggle and stored under the same path as this R markdown file.
In order to supplement the dataset with more features, we extract the historical weather records from Meteostat (https://dev.meteostat.net/python/hourly.html#response-parameters). We use the weather data recorded by a weather station in Edinburgh Airport, which is about 8 miles away from the Edinburgh old town. The dataset is downloaded and stored under the same path of this R markdown file.
Combining the bakery transaction record and the historical weather record, there are few questions we could further explore, such as:
Knowing the date, hour, and weather forecast, to predict the number of transactions during that specific hour.
Which items tend to be purchased together?
Is there any cluster of the transactions?
Firstly we will read the dataset of the bakery transaction record, and the hourly weather record in Edinburgh.
df_ori <- read.csv(file = 'BreadBasket_DMS.csv',
header = TRUE,
encoding = 'utf-8')
df_weather_ori <- read.csv(file = 'Edinburgh_weather_hourly.csv',
header = TRUE,
encoding = 'utf-8')
Then we display the first few rows of each data set.
head(df_ori)
## Date Time Transaction Item
## 1 2016-10-30 09:58:11 1 Bread
## 2 2016-10-30 10:05:34 2 Scandinavian
## 3 2016-10-30 10:05:34 2 Scandinavian
## 4 2016-10-30 10:07:57 3 Hot chocolate
## 5 2016-10-30 10:07:57 3 Jam
## 6 2016-10-30 10:07:57 3 Cookies
head(df_weather_ori)
## time temp dwpt rhum prcp snow wdir wspd wpgt pres tsun coco station
## 1 1/1/2016 0:00 2 0.1 87 NA NA NA NA NA 1013 NA NA 3160
## 2 1/1/2016 1:00 2 0.1 87 NA NA 240 22.3 NA 1014 NA NA 3160
## 3 1/1/2016 2:00 2 0.1 87 NA NA 230 25.9 NA 1015 NA NA 3160
## 4 1/1/2016 3:00 3 -1.0 75 NA NA 230 20.5 NA 1015 NA NA 3160
## 5 1/1/2016 4:00 2 -2.0 75 NA NA 240 9.4 NA 1016 NA NA 3160
## 6 1/1/2016 5:00 2 -2.0 75 NA NA NA 3.6 NA 1016 NA NA 3160
We will create the keys in order to merge two datasets.
df <- df_ori
df_weather <- df_weather_ori
df$Date <- as.Date(df$Date, "%Y-%m-%d")
df$Hour <- as.numeric(substr(df$Time, 1, 2))
df$key <- paste(as.character(df$Date), "@", as.character(df$Hour))
df_weather <- df_weather %>% separate(time, c("Date", "Hour_Minute"), " ")
df_weather <- df_weather %>% separate(Hour_Minute, c("Hour", "Minute"), ":")
df_weather$Date <- as.Date(df_weather$Date, "%m/%d/%Y")
df_weather$Hour <- as.numeric(df_weather$Hour)
df_weather$key <- paste(as.character(df_weather$Date), "@", as.character(df_weather$Hour))
Now we will deal with the data type and missing values, if any. We will start with the bakery transaction record database.
summary(df)
## Date Time Transaction Item
## Min. :2016-10-30 12:07:39: 16 Min. : 1 Coffee :5471
## 1st Qu.:2016-12-03 10:45:21: 13 1st Qu.:2548 Bread :3325
## Median :2017-01-21 10:55:19: 13 Median :5067 Tea :1435
## Mean :2017-01-17 14:38:01: 13 Mean :4952 Cake :1025
## 3rd Qu.:2017-02-28 13:43:08: 12 3rd Qu.:7329 Pastry : 856
## Max. :2017-04-09 14:19:47: 12 Max. :9684 NONE : 786
## (Other) :21214 (Other):8395
## Hour key
## Min. : 1.00 Length:21293
## 1st Qu.:10.00 Class :character
## Median :12.00 Mode :character
## Mean :12.25
## 3rd Qu.:14.00
## Max. :23.00
##
There is no NAs. Then we will verify whether there are duplicated rows in the dataset
dim(df[duplicated(df), ])[1]
## [1] 1653
There are duplicated rows. Considering that our analysis will not focus on the quantities of item sold, it is OK to remove those duplicated rows.
df <- distinct(df)
Given the nature of the bakery business, we may expect different behaviors during weekdays and weekends.
df$Weekday <- weekdays(df$Date, abbreviate = TRUE)
df$Weekday <- factor(df$Weekday,
levels = c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'))
res <- ddply(df, ~Weekday, summarise, No_of_transaction = length(unique(Transaction)))
ggplot(data = res, mapping = aes(x = Weekday, y = No_of_transaction, fill = Weekday)) +
geom_bar(stat = 'identity',width=0.7,alpha=0.8) +
labs(title = 'No. of Transactions by Weekdays', x = 'Weekdays', y = 'Number of Transactions') +
scale_fill_brewer(palette = "BrBG") +
theme_minimal()
As expected, there are more transactions on Saturday. However, the number of transactions on Sunday seem to be low. To further understand what happened, we will look at the number of transactions by hour by weekdays.
res <- ddply(df, .(Weekday, Hour), summarise, No_of_transaction = length(unique(Transaction)))
ggplot(data = res, mapping = aes(x = as.factor(Hour), y = No_of_transaction, fill = as.factor(Hour))) +
geom_bar(stat = "identity",alpha=0.8) + facet_wrap(~ Weekday) +
labs(title = 'No. of Transactions by Hour by Weekdays', x = 'Hours', y = 'Number of Transactions') +
scale_fill_manual(values = mycolors) +
theme_minimal() +
theme(legend.position = "none",
axis.text = element_text(size=8),
)
Looking at the distribution of transactions by hours, we noticed that the trend for Saturday and Sunday are similar and are different from the ones of the other days. Therefore, it makes sense to group Saturday and Sunday together, though Sunday has less transactions comparing with Saturday. This also implies that we may want to split the dataset into weekdays data and weekend data to perform further analysis.
df$Weekend <- mapvalues(df$Weekday,
from = c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'),
to = c(0, 0, 0, 0, 0, 1, 1))
From the figure presented in 3.2.2, we noticed that there are few transactions associated with abnormal hours, such as 1 am and 11 pm.
res <- ddply(df, ~Hour, summarise, No_of_transaction = length(unique(Transaction)))
ggplot(data = res, mapping = aes(x = as.factor(Hour), y = No_of_transaction, fill = as.factor(Hour))) +
geom_bar(stat = 'identity',alpha=0.8) +
labs(title = 'No. of Transactions by Hour', x = 'Hours', y = 'Number of Transactions') +
scale_fill_manual(values = mycolors) +
theme_minimal() +
theme(legend.position = "none")
Considering the small amount of transactions associated with abnormal hours, we will drop the rows whose hours is outside a normal business operating time. In other words, we will drop rows whose hour is 1, 21, 22, or 23 from our dataset.
df <- df%>%filter(Hour<21&Hour>1)
Another observation is that the amount of transaction within an hour varies by the time of the day. Here we split the day into two segments: rush hours from 9 to 15 and non-rush hours for the rest of the day. Such split makes sense practically as the period from 9 to 15 covers breakfast, lunch, and coffee or tea time in the afternoon.
df$Rush_hours <- mapvalues(df$Hour,
from = c(7, 8,
9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20),
to = c(0, 0,
1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0))
df$Rush_hours<-as.factor(df$Rush_hours)
We also want to take a look at sales during holidays. We will focus on two holidays, Christmas and New Year. 0 = non-holiday 1 = Christmas|New Year (12/24/2016 - 1/1/2017) Bakery did not open on 12/25/2016, 12/26/2016, 1/1/2017
df$Holiday<-0
df$Holiday[df$Date>'2016-12-23'&df$Date<'2017-01-02']<-1
hol<-df%>%group_by(Holiday)%>%summarize(mean_transaction = length(unique(Transaction))/length(unique(Date)))
## `summarise()` ungrouping output (override with `.groups` argument)
hol%>%ggplot(aes(as.factor(Holiday),mean_transaction, fill = as.factor(Holiday),label = round(mean_transaction,2))) +
labs(title = 'Average No. of Transactions by Holiday', x = 'Holiday', y = 'Average Number of Transactions') +
geom_bar(stat = 'identity', width = 0.5,alpha=0.8) +
geom_text(vjust = -0.5) +
scale_fill_manual(values = cookiecol[c(5,7)]) +
theme_minimal()
df$Holiday<-as.factor(df$Holiday)
Indeed, the average number of transaction on holidays is 12% less than the one of non-holidays.
Then we will focus on the “Item” column to understand what are included.
Item_tb <- table(df$Item)
sort(Item_tb, decreasing = TRUE)
##
## Coffee Bread
## 4528 3096
## Tea Cake
## 1350 983
## Pastry NONE
## 815 753
## Sandwich Medialuna
## 680 585
## Hot chocolate Cookies
## 551 515
## Brownie Farm House
## 379 371
## Juice Muffin
## 364 364
## Alfajores Scone
## 344 327
## Soup Toast
## 326 318
## Scandinavian Truffles
## 274 192
## Coke Spanish Brunch
## 184 172
## Baguette Tiffin
## 152 146
## Fudge Jam
## 142 142
## Mineral water Jammie Dodgers
## 133 125
## Chicken Stew Hearty & Seasonal
## 123 100
## Salad Frittata
## 99 81
## Smoothies Keeping It Local
## 77 63
## The Nomad Focaccia
## 58 54
## Vegan mincepie Bakewell
## 52 48
## Tartine Afternoon with the baker
## 46 43
## Art Tray Extra Salami or Feta
## 38 38
## Eggs Granola
## 28 28
## Tshirt My-5 Fruit Shoot
## 21 18
## Ella's Kitchen Pouches Crisps
## 17 14
## Dulce de Leche Duck egg
## 13 12
## Kids biscuit Pick and Mix Bowls
## 12 12
## Christmas common Mighty Protein
## 11 11
## Tacos/Fajita Valentine's card
## 11 11
## Postcard Chocolates
## 10 9
## Gingerbread syrup Vegan Feast
## 9 9
## Drinking chocolate spoons Muesli
## 8 8
## Nomad bag Argentina Night
## 8 7
## Coffee granules Empanadas
## 7 7
## Victorian Sponge Basket
## 7 6
## Crepes Half slice Monster
## 6 6
## Honey Lemon and coconut
## 6 6
## Pintxos Bare Popcorn
## 6 5
## Mortimer Panatone
## 5 5
## Bread Pudding Brioche and salami
## 4 3
## Caramel bites Cherry me Dried fruit
## 3 3
## Raspberry shortbread sandwich Bowl Nic Pitt
## 3 2
## Chimichurri Oil Fairy Doors
## 2 2
## Hack the stack Siblings
## 2 2
## Spread Adjustment
## 2 1
## Bacon Chicken sand
## 1 1
## Gift voucher Olum & polenta
## 1 1
## Polenta Raw bars
## 1 1
## The BART
## 1
This list of itme seems to be inconsistent and confusing. For example, there are items named “NONE”. Also, Brownie is separated from Cakes, Baguette is not considered as Bread, and Medialuna is treated differently from Pastry.
The item type “Adjustment” and “None” are probably introduced by the transaction tracking system or the cashier. In other words, there is no real purchase behind each of them. So we will drop them from the dataset. Then, the ‘Item_Type’ column is reorganized and coded as following:
Bread = 1
Cookies = 2
Cake|Pastry|Sweets = 3
Coffee = 4
Tea = 5
Hot chocolate|Smoothie|Juice = 6
Other beverage = 7
Meal = 8
Other = 9
*Ambiguous items are coded as ‘Other’
df <- df %>% filter(Item!='NONE' & Item!='Adjustment')
df$Item_Type <- 9
df$Item_Type[df$Item%in%c('Bread', 'Farm House', 'Toast','Baguette','Focaccia')]<-1
df$Item_Type[df$Item == 'Cookies']<-2
df$Item_Type[df$Item%in%c('Cake','Pastry','Medialuna','Brownie','Muffin','Alfajores','Scone','Scandinavian','Truffles','Tiffin','Fudge','Jammie Dodgers','Bakewell','Tartine','Vegan mincepie')]<-3
df$Item_Type[df$Item == 'Coffee']<-4
df$Item_Type[df$Item == 'Tea']<-5
df$Item_Type[df$Item%in%c('Hot chocolate', 'Juice', 'Smoothies')]<-6
df$Item_Type[df$Item%in%c('Mineral water', 'Coke')]<-7
df$Item_Type[df$Item%in%c('Sandwich', 'Soup', 'Spanish Brunch', 'Chicken Stew', 'Salad','Frittata')]<-8
df$Item_Type<-as.factor(df$Item_Type)
df%>%group_by(Item_Type)%>%
summarize(Count = n()) %>%
ggplot(aes(x=Item_Type,y=Count,fill=Item_Type,label = Count)) +
geom_bar(stat="identity", width=0.7,alpha=0.7) +
theme_minimal() +
scale_fill_manual(values = cookiecol,labels = c("1:Bread", "2:Cookies", "3:Cake|Pastry|Sweets",'4:Coffee','5:Tea','6:Hot chocolate|Smoothie|Juice','7:Other beverage','8:Meal','9:Other')) +
geom_text(vjust = -0.5,size=3) +
labs(title = 'Item Frequency in Unique Transactions', x = 'Item Type', y = 'Number of Transactions')
## `summarise()` ungrouping output (override with `.groups` argument)
From the Figure above, cakes/pastries/sweets are the most popular items in the bakery, followed by Coffee and Bread.
We also want to see if the transaction of each item varies by the hours in the day. We decided to focus on bread, cookies, cake/pastry/sweets, coffee and tea.
bread<-df%>%filter(Item_Type==1)%>%
group_by(as.factor(Hour))%>%
mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(df$Transaction))) %>%
ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
geom_bar(stat = 'identity',alpha=0.8) +
facet_wrap(~ Item_Type) +
labs(title = 'Average Number of Transactions of Bread by Hour', x = 'Hours', y = 'Avg No. of Transaction') +
scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)
cookie<-df%>%filter(Item_Type==2)%>%
group_by(as.factor(Hour))%>%
mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(df$Transaction))) %>%
ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
geom_bar(stat = 'identity',alpha=0.8) +
labs(title = 'Average Number of Transactions of Cookies by Hour', x = 'Hours', y = 'Avg No. of Transaction') +
scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)
pastry<-df%>%filter(Item_Type==3)%>%
group_by(as.factor(Hour))%>%
mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(df$Transaction))) %>%
ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
geom_bar(stat = 'identity',alpha=0.8) +
labs(title = 'Average Number of Transactions of Cakes/Pastries/Sweets by Hour', x = 'Hours', y = 'Avg No. of Transaction') +
scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)
coffee<-df%>%filter(Item_Type==4)%>%
group_by(as.factor(Hour))%>%
mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(df$Transaction))) %>%
ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
geom_bar(stat = 'identity',alpha=0.8) +
labs(title = 'Average Number of Transactions of Coffee by Hour', x = 'Hours', y = 'Avg No. of Transaction') +
scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)
tea<-df%>%filter(Item_Type==5)%>%
group_by(as.factor(Hour))%>%
mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(df$Transaction))) %>%
ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
geom_bar(stat = 'identity',alpha=0.8) +
labs(title = 'Average Number of Transactions of Tea by Hour', x = 'Hours', y = 'Avg No. of Transaction') +
scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)
grid.arrange(bread,cookie,pastry,coffee,tea, nrow=3,ncol=2,
top = textGrob("Average Transaction Frequency by Hours: Bread,Cookies,Pastries,Coffee, and Tea",gp=gpar(fontsize=14,font=3)))
The Figure above shows that bread and coffee on average tend to be sold more in the morning (~11am) while tea tends to be sold more in the afternoon. Transaction frequencies of cookies and cakes/pastries/sweet have peaks in both morning and afternoon.
Let’s start with the high level summary of the dataset.
summary(df_weather)
## Date Hour Minute temp
## Min. :2016-01-01 Min. : 0.00 Length:17544 Min. :-8.000
## 1st Qu.:2016-07-01 1st Qu.: 5.00 Class :character 1st Qu.: 5.000
## Median :2016-12-31 Median :11.00 Mode :character Median :10.000
## Mean :2016-12-31 Mean :11.50 Mean : 9.165
## 3rd Qu.:2017-07-02 3rd Qu.:17.25 3rd Qu.:13.000
## Max. :2018-01-01 Max. :23.00 Max. :28.000
##
## dwpt rhum prcp snow
## Min. :-8.900 Min. : 29.00 Mode:logical Mode:logical
## 1st Qu.: 2.100 1st Qu.: 76.00 NA's:17544 NA's:17544
## Median : 6.900 Median : 86.00
## Mean : 6.165 Mean : 82.56
## 3rd Qu.:10.000 3rd Qu.: 93.00
## Max. :18.000 Max. :100.00
##
## wdir wspd wpgt pres tsun
## Min. : 10.0 Min. : 0.00 Mode:logical Min. : 965 Mode:logical
## 1st Qu.:130.0 1st Qu.: 7.60 NA's:17544 1st Qu.:1006 NA's:17544
## Median :240.0 Median :13.00 Median :1014
## Mean :200.5 Mean :15.04 Mean :1013
## 3rd Qu.:250.0 3rd Qu.:20.50 3rd Qu.:1020
## Max. :360.0 Max. :66.60 Max. :1040
## NA's :1140 NA's :28 NA's :824
## coco station key
## Min. : 5.000 Min. :3160 Length:17544
## 1st Qu.: 7.000 1st Qu.:3160 Class :character
## Median : 7.000 Median :3160 Mode :character
## Mean : 7.139 Mean :3160
## 3rd Qu.: 7.000 3rd Qu.:3160
## Max. :25.000 Max. :3160
## NA's :14714
There are NA values in some of the columns. The columns “prcp”, “snow”, “wpgt” and “tsun” contain only NA values, so we will drop them.
The website of Meteostat gives clear explanation of each column (https://dev.meteostat.net/python/hourly.html#response-parameters):
station: The Meteo ID of the weather station
temp: The air temperature in C
dwpt: The dew point in C
rhum: The relative humidy in percent
wdir: The average wind direction in degrees
wspd: The average wind speed in km/h
pres: The average sea-level air pressure in hPa
coco: The weather condition code (https://dev.meteostat.net/docs/formats.html#weather-condition-codes). Only significant weather events are reported here.
Based on the description of each columns, we could drop the “station” column, as it remains the same for all the observations. We could also drop the “dwpt” column, as we already include the “temp” in our features. Another variable to drop is “pres”, as the air pressure is difficult to interpret for people without the background in Meteorology. For the missing values in “coco”, we will fill 0, as it stands for non-significant weather events.
We will also remove the “Minute” column as it is always 00.
df_weather <- subset(df_weather,
select = -c(Minute, prcp, snow, wpgt, tsun, station, dwpt, pres))
For the missing values in “wdir” and “wspd”, considering the consistency of weather conditions, we will use the value of the hour right after.
df_weather$coco[is.na(df_weather$coco)] <- 0
df_weather <- df_weather %>% tidyr::fill(wdir, .direction = "up")
df_weather <- df_weather %>% tidyr::fill(wspd, .direction = "up")
We now will merge the two datasets based on the pre-defined keys.
df_merged <- merge(x = df, y = df_weather, by = 'key', all.x = TRUE)
df_merged$Date <- df_merged$Date.x
df_merged$Hour <- df_merged$Hour.x
df_merged <- subset(df_merged,
select = -c(key, Time, Date.y, Hour.y, Date.x, Hour.x))
df_merged$coco<-as.factor(df_merged$coco)
df_merged_feature <- subset(df_merged, select = -c(Transaction, Item, Weekday,Weekend, Item_Type, Date,Holiday,coco,Rush_hours))
#Calculate correlation here
corr <- round(cor(df_merged_feature), digits = 2)
#Use ggcorrplot to graph correlation. Only plot the lower triangle of the correlation matrix.
ggcorrplot(corr, type = "lower",
ggtheme = ggplot2::theme_minimal,
lab = TRUE,
colors = c(cookiecol[5],'white',cookiecol[7]))
The correlation table suggests possible correlation between the temperature and the humidity, and between the temperature and the speed of wind.
# To build the dataset for the regression problem
# Keep Hour as a feature
df_merged_reg<-df_merged%>%
group_by(Date, Hour)%>%
mutate(No_of_transaction = length(unique(Transaction)))%>%
ungroup()%>%
#select(-c(Transaction, Item, Item_Type,Date,Hour))%>%
select(-c(Transaction, Item, Item_Type,Date))%>%
unique()%>%
data.frame()
# Keep Weekday as a feature
df_weekday<-df_merged_reg%>%
filter(Weekend==0)%>%
#select(-Weekday,-Weekend)%>%
select(-Weekend)%>%
data.frame()
summary(df_weekday)
## Weekday Rush_hours Holiday temp rhum wdir
## Mon:201 0:308 0:1065 Min. :-6.00 Min. : 47.00 Min. : 10.0
## Tue:225 1:790 1: 33 1st Qu.: 5.00 1st Qu.: 71.25 1st Qu.:200.0
## Wed:222 Median : 7.00 Median : 81.00 Median :240.0
## Thu:228 Mean : 6.85 Mean : 80.01 Mean :212.5
## Fri:222 3rd Qu.: 9.00 3rd Qu.: 87.00 3rd Qu.:260.0
## Sat: 0 Max. :15.00 Max. :100.00 Max. :360.0
## Sun: 0
## wspd coco Hour No_of_transaction
## Min. : 0.00 0 :978 Min. : 7.00 Min. : 1.000
## 1st Qu.: 9.40 5 : 20 1st Qu.:10.00 1st Qu.: 3.000
## Median :14.80 7 : 84 Median :12.00 Median : 5.000
## Mean :16.84 8 : 2 Mean :12.42 Mean : 5.593
## 3rd Qu.:24.10 12: 5 3rd Qu.:15.00 3rd Qu.: 8.000
## Max. :55.40 17: 9 Max. :20.00 Max. :17.000
##
# Keep Weekday as a feature
df_weekend<-df_merged_reg%>%
filter(Weekend==1)%>%
#select(-Weekday,-Weekend)%>%
select(-Weekend)%>%
data.frame()
summary(df_weekend)
## Weekday Rush_hours Holiday temp rhum wdir
## Mon: 0 0: 83 0:378 Min. :-2.000 Min. : 31.00 Min. : 10.0
## Tue: 0 1:313 1: 18 1st Qu.: 4.000 1st Qu.: 76.00 1st Qu.:220.0
## Wed: 0 Median : 7.000 Median : 82.00 Median :240.0
## Thu: 0 Mean : 7.452 Mean : 81.41 Mean :218.7
## Fri: 0 3rd Qu.:10.000 3rd Qu.: 93.00 3rd Qu.:260.0
## Sat:226 Max. :16.000 Max. :100.00 Max. :360.0
## Sun:170
## wspd coco Hour No_of_transaction
## Min. : 0.00 0 :333 Min. : 7.00 Min. : 1.000
## 1st Qu.: 9.40 5 : 10 1st Qu.:10.00 1st Qu.: 5.000
## Median :14.80 7 : 48 Median :12.00 Median : 8.000
## Mean :16.44 8 : 0 Mean :12.44 Mean : 8.359
## 3rd Qu.:22.30 12: 1 3rd Qu.:15.00 3rd Qu.:11.000
## Max. :48.20 17: 4 Max. :20.00 Max. :25.000
##
Next, we plot scatter plots of number of transaction against different weather features, respectively.
# scatter plots of number of transaction against different weather features, respectively
temp<-df_merged_reg%>%ggplot(aes(temp,No_of_transaction)) +
geom_jitter(alpha = 0.7, size=0.7, colour = cookiecol[3]) +
labs(x='Temperature in Celcius', y='Number of Transaction') +
theme_minimal()
ws<-df_merged_reg%>%ggplot(aes(wspd,No_of_transaction)) +
geom_jitter(alpha = 0.7, size=0.7, colour = cookiecol[3]) +
labs(x='Wind Speed', y='Number of Transaction') +
theme_minimal()
hum<-df_merged_reg%>%ggplot(aes(rhum,No_of_transaction)) +
geom_jitter(alpha = 0.7, size=0.7, colour = cookiecol[3]) +
labs(x='Humidity', y='Number of Transaction') +
theme_minimal()
wcc<-df_merged_reg%>%ggplot(aes(coco,No_of_transaction)) +
geom_jitter(alpha = 0.7, size=0.7, colour = cookiecol[3]) +
labs(x='Weather Condition Code', y='Number of Transaction') +
theme_minimal()
grid.arrange(temp,ws,hum,wcc, nrow = 2)
Here we will construct the dataset to perform market basket analysis at a later stage.
df_merged_mba <- df_merged
# map back the name of Item type
df_merged_mba$Item_Type_Name[df$Item_Type == 1] <- 'Bread'
df_merged_mba$Item_Type_Name[df$Item_Type == 2] <- 'Cookies'
df_merged_mba$Item_Type_Name[df$Item_Type == 3] <- 'Cake|Pastry|Sweets'
df_merged_mba$Item_Type_Name[df$Item_Type == 4] <- 'Coffee'
df_merged_mba$Item_Type_Name[df$Item_Type == 5] <- 'Tea'#'Other beverage'#'Tea'
df_merged_mba$Item_Type_Name[df$Item_Type == 6] <- 'Hot chocolate|Smoothie|Juice'#'Other beverage'
df_merged_mba$Item_Type_Name[df$Item_Type == 7] <- 'Other beverage'
df_merged_mba$Item_Type_Name[df$Item_Type == 8] <- 'Meal'
df_merged_mba$Item_Type_Name[df$Item_Type == 9] <- 'Other'
df_merged_mba_item_type_temp <- subset(df_merged_mba,
select = c(Transaction, Item_Type_Name, Weekend))
df_merged_mba_item_type_temp <- distinct(df_merged_mba_item_type_temp)
# Remove the transactions including only one item type
df_merged_mba_item_type_temp <-
df_merged_mba_item_type_temp %>%
group_by(Transaction) %>%
mutate(freq = n()) %>%
data.frame()
df_merged_mba_item_type_temp <- df_merged_mba_item_type_temp[df_merged_mba_item_type_temp$freq > 1, ]
# MBA on category of items
df_merged_mba_item_type <- df_merged_mba_item_type_temp %>%
group_by(Transaction) %>%
summarise(basket = as.vector(list(Item_Type_Name)))
## `summarise()` ungrouping output (override with `.groups` argument)
# MBA on items
df_merged_mba_item <- df_merged_mba %>%
group_by(Transaction) %>%
mutate(basket = as.vector(list(Item)))%>%ungroup()%>%data.frame()
mba_wkday<-df_merged_mba_item%>%
filter(Weekend==0)%>%
select(-Weekend)
mba_wkend<-df_merged_mba_item%>%
filter(Weekend==1)%>%
select(-Weekend)
Here we will add the number of type of items for each transaction.
df_merged_cluster<-df_merged%>%
#group_by(Transaction)%>%
#mutate(No_Item_Type = length(unique(Item)))%>%
#ungroup()%>%
select(-Transaction,-Item, -Weekday,-Date)%>%
unique()%>%
data.frame()
df_merged_cluster[,c(2:10)]<-sapply(df_merged_cluster[,c(2:10)], as.numeric)
cluster_wkday<-df_merged_cluster%>%filter(Weekend==0)%>%select(-Weekend)%>%data.frame()
cluster_wkend<-df_merged_cluster%>%filter(Weekend==1)%>%select(-Weekend)%>%data.frame()
summary(cluster_wkday)
## Rush_hours Holiday Item_Type temp
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :-6.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 5.000
## Median :2.000 Median :1.000 Median :4.000 Median : 7.000
## Mean :1.805 Mean :1.029 Mean :4.196 Mean : 7.063
## 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:6.000 3rd Qu.:10.000
## Max. :2.000 Max. :2.000 Max. :9.000 Max. :15.000
## rhum wdir wspd coco
## Min. : 47.00 Min. : 10.0 Min. : 0.00 Min. :1.000
## 1st Qu.: 71.00 1st Qu.:200.0 1st Qu.: 9.40 1st Qu.:1.000
## Median : 81.00 Median :240.0 Median :14.80 Median :1.000
## Mean : 79.24 Mean :212.6 Mean :17.02 Mean :1.232
## 3rd Qu.: 87.00 3rd Qu.:260.0 3rd Qu.:24.10 3rd Qu.:1.000
## Max. :100.00 Max. :360.0 Max. :55.40 Max. :6.000
## Hour
## Min. : 7.0
## 1st Qu.:10.0
## Median :13.0
## Mean :12.5
## 3rd Qu.:15.0
## Max. :20.0
summary(cluster_wkend)
## Rush_hours Holiday Item_Type temp
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :-2.00
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 5.00
## Median :2.000 Median :1.000 Median :4.000 Median : 8.00
## Mean :1.855 Mean :1.042 Mean :4.529 Mean : 7.63
## 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:6.000 3rd Qu.:10.00
## Max. :2.000 Max. :2.000 Max. :9.000 Max. :16.00
## rhum wdir wspd coco
## Min. : 31.00 Min. : 10.0 Min. : 0.00 Min. :1.000
## 1st Qu.: 76.00 1st Qu.:220.0 1st Qu.: 9.40 1st Qu.:1.000
## Median : 82.00 Median :240.0 Median :14.80 Median :1.000
## Mean : 80.86 Mean :216.1 Mean :16.35 Mean :1.328
## 3rd Qu.: 93.00 3rd Qu.:260.0 3rd Qu.:22.30 3rd Qu.:1.000
## Max. :100.00 Max. :360.0 Max. :48.20 Max. :6.000
## Hour
## Min. : 7.00
## 1st Qu.:11.00
## Median :12.00
## Mean :12.47
## 3rd Qu.:14.00
## Max. :20.00
The task of this analysis is to predict the number of transactions during that specific hour, knowing the date, hour, and weather forecast. More importantly, from the model created, we expert to further understand how the sales is driven by external variables, such as weather, day of the week, time of the day and so on.
Firstly, we will split the dataset into training set and validation set.
set.seed(0)
train_wkday<-sample(1:nrow(df_weekday), round(0.8*dim(df_weekday)[1]))
train_wkend <- sample(1:nrow(df_weekend), round(0.8*dim(df_weekend)[1]))
num_features <- dim(df_weekday)[2]
We will start with Lasso regression model. Weekday
x_wkday <- model.matrix(No_of_transaction~.,df_weekday)[,-1]
y_wkday <- df_weekday$No_of_transaction
y.test_wkday <- y_wkday[-train_wkday]
grid=10^seq(10,-2, length =100)
lasso.mod=glmnet(x_wkday[train_wkday,],y_wkday[train_wkday],alpha=1,lambda=grid)
plot(lasso.mod)
## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values
cv.out=cv.glmnet(x_wkday[train_wkday,],y_wkday[train_wkday],alpha=1)
plot(cv.out)
Then we run the lasso model on the test set to see the RMSE.
bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=x_wkday[-train_wkday,])
#RMSE
lasso_RMSE_wkday <- sqrt(mean((lasso.pred-y.test_wkday)^2))
lasso_RMSE_wkday
## [1] 2.880674
out <- glmnet(x_wkday,y_wkday,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:8,]
lasso.coef
## (Intercept) WeekdayTue WeekdayWed WeekdayThu WeekdayFri WeekdaySat
## 3.65314614 -0.08972362 -0.38786168 0.00000000 1.04048780 0.00000000
## WeekdaySun Rush_hours1
## 0.00000000 3.64261787
lasso.coef[lasso.coef!=0]
## (Intercept) WeekdayTue WeekdayWed WeekdayFri Rush_hours1
## 3.65314614 -0.08972362 -0.38786168 1.04048780 3.64261787
Weekend
x_wkend <- model.matrix(No_of_transaction~.,df_weekend)[,-1]
y_wkend <- df_weekend$No_of_transaction
y.test_wkend <- y_wkend[-train_wkend]
grid=10^seq(10,-2, length =100)
lasso.mod=glmnet(x_wkend[train_wkend,],y_wkend[train_wkend],alpha=1,lambda=grid)
plot(lasso.mod)
## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values
cv.out=cv.glmnet(x_wkend[train_wkend,],y_wkend[train_wkend],alpha=1)
plot(cv.out)
The test set RMSE of the model chosen by lasso is
bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=x_wkend[-train_wkend,])
#RMSE
lasso_RMSE_wkend <- sqrt(mean((lasso.pred-y.test_wkend)^2))
lasso_RMSE_wkend
## [1] 4.063711
out <- glmnet(x_wkend,y_wkend,alpha=1,lambda=grid)
lasso.coef=predict(out,type="coefficients",s=bestlam)[1:8,]
lasso.coef
## (Intercept) WeekdayTue WeekdayWed WeekdayThu WeekdayFri WeekdaySat
## 3.491346 0.000000 0.000000 0.000000 0.000000 2.214838
## WeekdaySun Rush_hours1
## 0.000000 5.117108
lasso.coef[lasso.coef!=0]
## (Intercept) WeekdaySat Rush_hours1
## 3.491346 2.214838 5.117108
Random Forest Secondly, we try another model - random forest regression model. We will try several different sets of hyperparameters, including ‘mtry’ and ‘ntree’ to figure out the best random forest regression model as baseline. Weekday
search_grid <- expand.grid(mtry = c(round(num_features/3/2),
round(num_features/3),
round(num_features/3*2)),
ntree = c(100, 500, 1000))
min_RMSE <- Inf
best_mtry <- 0
best_ntree <- 0
for (i in seq(1, nrow(search_grid))){
rf_reg_wkday <- randomForest(No_of_transaction~.,
df_weekday[train_wkday, ],
mtry = search_grid[i, ]$mtry,
ntree = search_grid[i, ]$ntree,
importance = TRUE
)
if (mean(rf_reg_wkday$mse) < min_RMSE) {
min_RMSE <- mean(rf_reg_wkday$mse)
best_mtry <- search_grid[i, ]$mtry
best_ntree <- search_grid[i, ]$ntree
}
}
# Train the random forest model using the best 'mtry' and 'ntree'.
rf_reg_best_wkday <- randomForest(No_of_transaction~.,
df_weekday[train_wkday, ],
mtry = best_mtry,
ntree = best_ntree,
importance = TRUE
)
rf_reg_best_wkday
##
## Call:
## randomForest(formula = No_of_transaction ~ ., data = df_weekday[train_wkday, ], mtry = best_mtry, ntree = best_ntree, importance = TRUE)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 7.123584
## % Var explained: 33.56
Then we train a regression trees model using boosting algorithm to compare with the random forest model constructed. Weekday
search_grid <- expand.grid(shrinkage = c(0.01, 0.001),
interaction.depth = c(4, 5),
n.minobsinnode = c(10, 100),
bag.fraction = c(0.5, 0.8),
optimal_trees = 0,
min_RMSE = 0)
for (i in seq(1, nrow(search_grid))){
boost.fit_wkday <- gbm(No_of_transaction~. ,
data = df_weekday[train_wkday, ],
distribution = "gaussian",
cv.folds = 10,
n.trees = 4000,
train.fraction = 0.75,
n.cores = NULL,
verbose = FALSE,
shrinkage = search_grid[i, ]$shrinkage,
interaction.depth = search_grid[i, ]$interaction.depth,
n.minobsinnode = search_grid[i, ]$n.minobsinnode,
bag.fraction = search_grid[i, ]$bag.fraction,
)
search_grid[i, ]$min_RMSE <- min(boost.fit_wkday$cv.error)
search_grid[i, ]$optimal_trees = match(min(boost.fit_wkday$cv.error), boost.fit_wkday$cv.error)
}
# Train and fit this model with the best sets of hyperparameters:
boost_best_para_wkday <-
search_grid[search_grid$min_RMSE == min(search_grid$min_RMSE),]
boost.best_wkday <- gbm(No_of_transaction~. ,
data = df_weekday[train_wkday, ],
distribution = "gaussian",
cv.folds = 10,
n.trees = 4000,
train.fraction = 0.75,
n.cores = NULL,
verbose = FALSE,
shrinkage = boost_best_para_wkday$shrinkage,
interaction.depth = boost_best_para_wkday$interaction.depth,
n.minobsinnode = boost_best_para_wkday$n.minobsinnode,
bag.fraction = boost_best_para_wkday$bag.fraction,
)
boost.best_wkday
## gbm(formula = No_of_transaction ~ ., distribution = "gaussian",
## data = df_weekday[train_wkday, ], n.trees = 4000, interaction.depth = boost_best_para_wkday$interaction.depth,
## n.minobsinnode = boost_best_para_wkday$n.minobsinnode, shrinkage = boost_best_para_wkday$shrinkage,
## bag.fraction = boost_best_para_wkday$bag.fraction, train.fraction = 0.75,
## cv.folds = 10, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 4000 iterations were performed.
## The best cross-validation iteration was 335.
## The best test-set iteration was 1078.
## There were 9 predictors of which 9 had non-zero influence.
Then we compare the performance of both the random forest model and the trees model using boosting algorithm on the Weekday validation set.
boost_pred_test_wkday <- predict(boost.best_wkday,
newdata = df_weekday[-train_wkday, ],
n.trees = 4000)
boost_RMSE_test_wkday <- caret::RMSE(boost_pred_test_wkday, df_weekday[-train_wkday, ]$No_of_transaction)
rf_pred_test_wkday <- predict(rf_reg_best_wkday,
newdata = df_weekday[-train_wkday, ])
rf_RMSE_test_wkday <- caret::RMSE(rf_pred_test_wkday, df_weekday[-train_wkday, ]$No_of_transaction)
print(paste0('RMSE of the boosting model on Weekday testing set:', round(boost_RMSE_test_wkday, 4)))
## [1] "RMSE of the boosting model on Weekday testing set:2.9753"
print(paste0('RMSE of the random forest model on Weekday testing set:', round(rf_RMSE_test_wkday, 4)))
## [1] "RMSE of the random forest model on Weekday testing set:2.8462"
print(paste0('RMSE of the lasso regression model on Weekday testing set:', round(lasso_RMSE_wkday, 4)))
## [1] "RMSE of the lasso regression model on Weekday testing set:2.8807"
For weekdays data, it turns out that the random forest model performs the best on the test set in term of RMSE. So we will choose the random forest model to explain how different features impact the number of transactions in a certain hour during a weekday. For the random forest model, the RMSE on the training set is 2.6818, which doesn’t suggest the risk of overfitting. However, one thing worth-noting is that our random forest could only explain 34% of variance. This is probably due to the very limited amount of features we have in our original dataset.
Weekend
search_grid <- expand.grid(mtry = c(round(num_features/3/2),
round(num_features/3),
round(num_features/3*2)),
ntree = c(100, 500, 1000))
min_RMSE <- Inf
best_mtry <- 0
best_ntree <- 0
for (i in seq(1, nrow(search_grid))){
rf_reg_wkend <- randomForest(No_of_transaction~.,
df_weekend[train_wkend, ],
mtry = search_grid[i, ]$mtry,
ntree = search_grid[i, ]$ntree,
importance = TRUE
)
if (mean(rf_reg_wkend$mse) < min_RMSE) {
min_RMSE <- mean(rf_reg_wkend$mse)
best_mtry <- search_grid[i, ]$mtry
best_ntree <- search_grid[i, ]$ntree
}
}
# Train the random forest model using the best 'mtry' and 'ntree'.
rf_reg_best_wkend <- randomForest(No_of_transaction~.,
df_weekend[train_wkend, ],
mtry = best_mtry,
ntree = best_ntree,
importance = TRUE
)
rf_reg_best_wkend
##
## Call:
## randomForest(formula = No_of_transaction ~ ., data = df_weekend[train_wkend, ], mtry = best_mtry, ntree = best_ntree, importance = TRUE)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 12.72002
## % Var explained: 41.38
search_grid <- expand.grid(shrinkage = c(0.01, 0.001),
interaction.depth = c(4, 5),
n.minobsinnode = c(10, 20),
bag.fraction = c(0.5, 0.8),
optimal_trees = 0,
min_RMSE = 0)
for (i in seq(1, nrow(search_grid))){
boost.fit_wkend <- gbm(No_of_transaction~. ,
data = df_weekend[train_wkend, ],
distribution = "gaussian",
cv.folds = 10,
n.trees = 4000,
train.fraction = 0.75,
n.cores = NULL,
verbose = FALSE,
shrinkage = search_grid[i, ]$shrinkage,
interaction.depth = search_grid[i, ]$interaction.depth,
n.minobsinnode = search_grid[i, ]$n.minobsinnode,
bag.fraction = search_grid[i, ]$bag.fraction,
)
search_grid[i, ]$min_RMSE <- min(boost.fit_wkend$cv.error)
search_grid[i, ]$optimal_trees = match(min(boost.fit_wkend$cv.error), boost.fit_wkend$cv.error)
}
# Train and fit this model with the best sets of hyperparameters:
boost_best_para_wkend <-
search_grid[search_grid$min_RMSE == min(search_grid$min_RMSE),]
boost.best_wkend <- gbm(No_of_transaction~. ,
data = df_weekend[train_wkend, ],
distribution = "gaussian",
cv.folds = 10,
n.trees = 4000,
train.fraction = 0.75,
n.cores = NULL,
verbose = FALSE,
shrinkage = boost_best_para_wkend$shrinkage,
interaction.depth = boost_best_para_wkend$interaction.depth,
n.minobsinnode = boost_best_para_wkend$n.minobsinnode,
bag.fraction = boost_best_para_wkend$bag.fraction,
)
boost.best_wkend
## gbm(formula = No_of_transaction ~ ., distribution = "gaussian",
## data = df_weekend[train_wkend, ], n.trees = 4000, interaction.depth = boost_best_para_wkend$interaction.depth,
## n.minobsinnode = boost_best_para_wkend$n.minobsinnode, shrinkage = boost_best_para_wkend$shrinkage,
## bag.fraction = boost_best_para_wkend$bag.fraction, train.fraction = 0.75,
## cv.folds = 10, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 4000 iterations were performed.
## The best cross-validation iteration was 432.
## The best test-set iteration was 265.
## There were 9 predictors of which 8 had non-zero influence.
boost_pred_test_wkend <- predict(boost.best_wkend,
newdata = df_weekend[-train_wkend, ],
n.trees = 4000)
boost_RMSE_test_wkend <- caret::RMSE(boost_pred_test_wkend, df_weekend[-train_wkend, ]$No_of_transaction)
rf_pred_test_wkend <- predict(rf_reg_best_wkend,
newdata = df_weekend[-train_wkend, ])
rf_RMSE_test_wkend <- caret::RMSE(rf_pred_test_wkend, df_weekend[-train_wkend, ]$No_of_transaction)
print(paste0('RMSE of the boosting model on Weekend testing set:', round(boost_RMSE_test_wkend, 4)))
## [1] "RMSE of the boosting model on Weekend testing set:3.6731"
print(paste0('RMSE of the random forest model on Weekend testing set:', round(rf_RMSE_test_wkend, 4)))
## [1] "RMSE of the random forest model on Weekend testing set:3.5549"
print(paste0('RMSE of the lasso regression model on Weekend testing set:', round(lasso_RMSE_wkend, 4)))
## [1] "RMSE of the lasso regression model on Weekend testing set:4.0637"
Same as the case of weekday data, for weekend data, it seems that the random forest model selected performs better than the other model selected. For the random forest model, the RMSE on the training set is 3.5981, which doesn’t suggest the risk of overfitting. However, only 41% of variace could be explained by the selected random forest model. This is probably again due to the fact that we only have very limited features in our original dataset.
Now that we have the regression model constructed, we would like to further understand the impact of different features.
Weekday
varImpPlot(rf_reg_best_wkday)
The most important factors are the time when the transaction happened and whether the time is the rush hours - this is not surprising. What’s interesting is that the weather related variables follow right after. We use partial dependence plot here to further understand the impact of weather related variables.
Weekday
par(mfrow = c(2, 2))
partialPlot(rf_reg_best_wkday, df_weekday[train_wkday, ], x.var = 'wdir')
partialPlot(rf_reg_best_wkday, df_weekday[train_wkday, ], x.var = 'wspd')
partialPlot(rf_reg_best_wkday, df_weekday[train_wkday, ], x.var = 'rhum')
partialPlot(rf_reg_best_wkday, df_weekday[train_wkday, ], x.var = 'temp')
From the partial dependence plot for Weekday data, we could learn that:
Temperature seems to have a U-shape effect on number of transactions. When the temperature is approximately below 2 Celsius, number of transactions decreases as temperature increases; when it is approximately above 2 Celsius, number of transactions increases as temperature increases. Potentially because during weekdays, there could be increasing number of transactions during rush hour, especially in the morning, when the temperature tends to be low.
When there is north wind or north-west wind (270-350), the sales will be higher than the case when the wind is blowing from other directions.
Average-level humidity seems to have a positive effect on number of transactions; however, when humidity is close to or over 90, which suggests raining weather, there is a sharp decrease, showing a negative effect on number of transactions.
Stronger wind has negative impact on the sales
To sum up, the most important variable in forecasting the number of transactions at a given time is whether it is during the rush hour. Also, different weather related variables have different impact on the number of transaction - so a practical suggestion to the bakery is to watch the weather forecast and be prepared.
Weekend
varImpPlot(rf_reg_best_wkend)
Similar as the case of weekdays, the hour when a transaction happened and whether it is during rush-hours are the most important features. Then the weather related variables also contribute in the model.
Weekend
par(mfrow = c(2, 2))
partialPlot(rf_reg_best_wkend, df_weekend[train_wkend, ], x.var = 'wdir')
partialPlot(rf_reg_best_wkend, df_weekend[train_wkend, ], x.var = 'wspd')
partialPlot(rf_reg_best_wkend, df_weekend[train_wkend, ], x.var = 'rhum')
partialPlot(rf_reg_best_wkend, df_weekend[train_wkend, ], x.var = 'temp')
From the partial dependence plot for Weekend data, we could learn that:
Interestingly, in contrast to the data in weekdays, temperature no longer has a U-shape effect; instead, overall, as temperature increases, number of transactions increases. That’s probably because no one needs to go buy breakfast in a cold Saturday or Sunday morning.
In weekends, when there is north wind or north-west wind (270-350), there is a huge increase in number of transactions.
In weekends, as humidity increases, number of transactions decreases overall. Indeed, it decreases dramatically once the humidity is higher than 40. The reason why it is different comparing with weekday would be that only few people will hang out if weather condition is not good in a weekend.
Just as weekdays, wind speed has a negative effect on number of transactions.
Comparing with the weekday case, the humidity and temperature impact the number of transactions differently. People have to go to work in weekdays regardless of the weather. But during weekend, the weather condition will play a more important role when people decide whether to go to the bakery.
The objective of this analysis is to identify possible association rules. For example, assuming we identify from the data that people who buy cake tend to buy cookie at the same time, then the bakery could consider to place these two item closer in the shop, or create bundle sales in order to boost the sales.
We have already re-grouped different items, so for this analysis, we will start from looking at the association rules among item types.
transactions_Item_Type <- as(df_merged_mba_item_type$basket, "transactions")
#inspect(transactions_Item_Type[1])
# Set the threshold low to identify more rules
# Set minlen = 2 to make sure there are items in LHS
rules_Item_Type <- apriori(transactions_Item_Type,
parameter = list(support = 0.05,
confidence = 0.025,
minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.025 0.1 1 none FALSE TRUE 5 0.05 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 246
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[9 item(s), 4928 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [27 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Remove redundant rules
rules_Item_Type <- rules_Item_Type[!is.redundant(rules_Item_Type)]
# Reorder by Lift and display
rules_Item_Type_dt <- data.table( lhs = labels( lhs(rules_Item_Type) ),
rhs = labels( rhs(rules_Item_Type) ),
quality(rules_Item_Type) )[ order(-lift), ]
kable(head(rules_Item_Type_dt, 10))
| lhs | rhs | support | confidence | coverage | lift | count |
|---|---|---|---|---|---|---|
| {Bread} | {Coffee} | 0.2500000 | 0.5116279 | 0.4886364 | 0.8739349 | 1232 |
| {Coffee} | {Bread} | 0.2500000 | 0.4270364 | 0.5854302 | 0.8739349 | 1232 |
| {Cake|Pastry|Sweets} | {Coffee} | 0.2759740 | 0.5087916 | 0.5424107 | 0.8690902 | 1360 |
| {Coffee} | {Cake|Pastry|Sweets} | 0.2759740 | 0.4714038 | 0.5854302 | 0.8690902 | 1360 |
| {Bread} | {Cake|Pastry|Sweets} | 0.2297078 | 0.4700997 | 0.4886364 | 0.8666858 | 1132 |
| {Cake|Pastry|Sweets} | {Bread} | 0.2297078 | 0.4234942 | 0.5424107 | 0.8666858 | 1132 |
| {Meal} | {Coffee} | 0.0951705 | 0.4885417 | 0.1948052 | 0.8345003 | 469 |
| {Coffee} | {Meal} | 0.0951705 | 0.1625650 | 0.5854302 | 0.8345003 | 469 |
| {Hot chocolate|Smoothie|Juice} | {Cake|Pastry|Sweets} | 0.0651380 | 0.4439834 | 0.1467127 | 0.8185373 | 321 |
| {Cake|Pastry|Sweets} | {Hot chocolate|Smoothie|Juice} | 0.0651380 | 0.1200898 | 0.5424107 | 0.8185373 | 321 |
From the table we observe that the highest lift of all rules is still smaller than 1. That says, no valid association rules identified among the item types.
Then we focus our market basket analysis on the detail list of items, instead of the category. Here we will analyze weekdays and weekends separately, as we are expecting different purchase behaviors.
Weekday
transactions_Item <- as(mba_wkday$basket, "transactions")
#inspect(transactions_Item[1])
# Set the threshold low to identify more rules
# Set minlen = 2 to make sure there are items in LHS
rules_Item <- apriori(transactions_Item,
parameter = list(support = 0.005,
confidence = 0.25,
minlen = 2)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.005 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 59
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[85 item(s), 11825 transaction(s)] done [0.00s].
## sorting and recoding items ... [39 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [115 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Remove redundant rules
rules_Item <- rules_Item[!is.redundant(rules_Item)]
# Reorder by Lift and display
rules_Item_dt_wkday <- data.table( lhs = labels( lhs(rules_Item) ),
rhs = labels( rhs(rules_Item) ),
quality(rules_Item) )[ order(-lift), ]
kable(head(rules_Item_dt_wkday, 20))
| lhs | rhs | support | confidence | coverage | lift | count |
|---|---|---|---|---|---|---|
| {Coffee,Coke} | {Sandwich} | 0.0059197 | 0.4347826 | 0.0136152 | 3.897880 | 70 |
| {Juice,Tea} | {Cookies} | 0.0053277 | 0.3519553 | 0.0151374 | 3.735971 | 63 |
| {Coke} | {Sandwich} | 0.0109937 | 0.3735632 | 0.0294292 | 3.349041 | 130 |
| {Mineral water} | {Sandwich} | 0.0085412 | 0.3519164 | 0.0242706 | 3.154974 | 101 |
| {Coffee,Truffles} | {Sandwich} | 0.0050740 | 0.3260870 | 0.0155603 | 2.923410 | 60 |
| {Coffee,Juice} | {Cookies} | 0.0105708 | 0.2642706 | 0.0400000 | 2.805207 | 125 |
| {Cake,Soup} | {Tea} | 0.0055814 | 0.5689655 | 0.0098097 | 2.794027 | 66 |
| {Coffee,Soup} | {Sandwich} | 0.0085412 | 0.2869318 | 0.0297674 | 2.572380 | 101 |
| {Bread,Soup} | {Tea} | 0.0054123 | 0.3950617 | 0.0136998 | 1.940035 | 64 |
| {Soup,Tea} | {Cake} | 0.0055814 | 0.2738589 | 0.0203805 | 1.935674 | 66 |
| {Coffee,Juice} | {Cake} | 0.0103171 | 0.2579281 | 0.0400000 | 1.823072 | 122 |
| {Spanish Brunch} | {Tea} | 0.0076956 | 0.3625498 | 0.0212262 | 1.780379 | 91 |
| {Chicken Stew} | {Tea} | 0.0078647 | 0.3419118 | 0.0230021 | 1.679031 | 93 |
| {Scone} | {Tea} | 0.0098943 | 0.3371758 | 0.0293446 | 1.655774 | 117 |
| {Soup} | {Tea} | 0.0203805 | 0.3370629 | 0.0604651 | 1.655220 | 241 |
| {Cookies,Juice} | {Tea} | 0.0053277 | 0.3150000 | 0.0169133 | 1.546875 | 63 |
| {Extra Salami or Feta} | {Coffee} | 0.0054968 | 0.8552632 | 0.0064271 | 1.524263 | 65 |
| {Tiffin} | {Tea} | 0.0064271 | 0.3102041 | 0.0207188 | 1.523324 | 76 |
| {Keeping It Local} | {Coffee} | 0.0085412 | 0.8145161 | 0.0104863 | 1.451643 | 101 |
| {Cake} | {Tea} | 0.0401691 | 0.2839211 | 0.1414799 | 1.394255 | 475 |
subrules2 <- head(sort(rules_Item, by = "lift"), 20)
ig <- plot( subrules2, method="graph",
control=list(type="items") )
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main = Graph for 20 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
Few observations could be made from the table:
During weekdays
People who buy drinks, such as coffee, juice, tea, and mineral water, tends to buy Sandwich
People who buy coffee and juice or juice and tea tends to buy cookies. Normally client won’t buy two drinks together if it is only for him or herself. We suspect in this case, it is someone buying drinks and cookies to share with friends.
People who buy lunch, such as cake and soup, bread and soup, Spanish brunch, chicken stew, and soup alone, tends to buy tea at the same time.
People who buy sweets, such as scone and cake, tends to buy tea at the same time.
To further explore the finding #4, we plot the distribution of transactions including scone or cake by hours.
p1<-mba_wkday%>%filter(Item=='Scone')%>%
group_by(as.factor(Hour))%>%
mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkday$Transaction))) %>%
ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
geom_bar(stat = 'identity',alpha=0.8) +
labs(title = 'Average Number of Transactions of Scone by Hour',
x = 'Hours', y = 'Avg No. of Transaction') +
scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)
p2<-mba_wkday%>%filter(Item=='Cake')%>%
group_by(as.factor(Hour))%>%
mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkday$Transaction))) %>%
ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
geom_bar(stat = 'identity',alpha=0.8) +
labs(title = 'Average Number of Transactions of Cake by Hour',
x = 'Hours', y = 'Avg No. of Transaction') +
scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)
grid.arrange(p1, p2,
nrow=1,ncol=2,
top = textGrob("Average Transaction Frequency by Hours",
gp=gpar(fontsize=14,font=3)))
It turns out that scone and cake are sold more in the afternoon. The association rules between scone, cake and tea probably refers to the snack in the afternoon.
Based on the above, maybe we could recommend the bakery, for weekdays:
To create bundles sales for drinks and sandwiches
To create bundles sales for multiple drinks and cookies or cakes
To create bundles sales for lunch and tea
To create bundles sales for afternoon snack and tea
Weekend
transactions_Item <- as(mba_wkend$basket, "transactions")
#inspect(transactions_Item[1])
# Set the threshold low to identify more rules
# Set minlen = 2 to make sure there are items in LHS
rules_Item_wkend <- apriori(transactions_Item,
parameter = list(support = 0.01,
confidence = 0.25,
minlen = 2)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 70
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[83 item(s), 7047 transaction(s)] done [0.00s].
## sorting and recoding items ... [34 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [84 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Remove redundant rules
rules_Item_wkend <- rules_Item_wkend[!is.redundant(rules_Item_wkend)]
# Reorder by Lift and display
rules_Item_dt_wkend <- data.table( lhs = labels( lhs(rules_Item_wkend) ),
rhs = labels( rhs(rules_Item_wkend) ),
quality(rules_Item_wkend) )[ order(-lift), ]
kable(head(rules_Item_dt_wkend, 10))
| lhs | rhs | support | confidence | coverage | lift | count |
|---|---|---|---|---|---|---|
| {Farm House} | {Medialuna} | 0.0102171 | 0.2599278 | 0.0393075 | 2.351362 | 72 |
| {Jammie Dodgers} | {Cake} | 0.0103590 | 0.3273543 | 0.0316447 | 1.909657 | 73 |
| {Coffee,Tea} | {Cake} | 0.0261104 | 0.2813456 | 0.0928054 | 1.641260 | 184 |
| {Coffee,Hot chocolate} | {Cake} | 0.0184476 | 0.2771855 | 0.0665531 | 1.616992 | 130 |
| {Bread,Tea} | {Cake} | 0.0129133 | 0.2684366 | 0.0481056 | 1.565954 | 91 |
| {Cake} | {Tea} | 0.0488151 | 0.2847682 | 0.1714205 | 1.530711 | 344 |
| {Tea} | {Cake} | 0.0488151 | 0.2623951 | 0.1860366 | 1.530711 | 344 |
| {Hot chocolate} | {Cake} | 0.0278133 | 0.2525773 | 0.1101178 | 1.473437 | 196 |
| {Scone} | {Tea} | 0.0224209 | 0.2585925 | 0.0867036 | 1.390008 | 158 |
| {Hot chocolate,Pastry} | {Coffee} | 0.0100752 | 0.7553191 | 0.0133390 | 1.335692 | 71 |
# Graph for 10 rules - Weekend
subrules2 <- head(sort(rules_Item_wkend, by = "lift"), 10)
ig <- plot( subrules2, method="graph",
control=list(type="items") )
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main = Graph for 10 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
In weekends, the association rules concentrate on few items, including cake, tea, coffee, hot chocolate, and scone. One thing worth-noting is that cake is in the center of the graph. In other words, people tends to buy cake when they are purchasing bunch of other items. The association rules also suggest that people sometime buy items to share with others, as there are transactions including both coffee and tea or coffee and hot chocolate.
Since cake is the center of the associations, we are curious when the sales of cakes peak and how the sales of different drinks associated with cakes vary by the hours of the day.
p3<-mba_wkend%>%filter(Item=='Cake')%>%
group_by(as.factor(Hour))%>%
mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkend$Transaction))) %>%
ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
geom_bar(stat = 'identity',alpha=0.8) +
labs(title = 'Average Number of Transactions of Cake by Hour',
x = 'Hours', y = 'Avg No. of Transaction') +
scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)
p4<-mba_wkend%>%filter(Item=='Coffee')%>%
group_by(as.factor(Hour))%>%
mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkend$Transaction))) %>%
ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
geom_bar(stat = 'identity',alpha=0.8) +
labs(title = 'Average Number of Transactions of Coffee by Hour',
x = 'Hours', y = 'Avg No. of Transaction') +
scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)
p5<-mba_wkend%>%filter(Item=='Tea')%>%
group_by(as.factor(Hour))%>%
mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkend$Transaction))) %>%
ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
geom_bar(stat = 'identity',alpha=0.8) +
labs(title = 'Average Number of Transactions of Tea by Hour',
x = 'Hours', y = 'Avg No. of Transaction') +
scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)
p6<-mba_wkend%>%filter(Item=='Hot chocolate')%>%
group_by(as.factor(Hour))%>%
mutate(Avg_No_of_transaction = length(unique(Transaction))/length(unique(mba_wkend$Transaction))) %>%
ggplot(aes(x = as.factor(Hour), y = Avg_No_of_transaction, fill = as.factor(Hour))) +
geom_bar(stat = 'identity',alpha=0.8) +
labs(title = 'Average Number of Transactions of Hot chocolate by Hour',
x = 'Hours', y = 'Avg No. of Transaction') +
scale_fill_manual(values = colorRampPalette(c('#ecc78d','#2e0a05'))(nb.cols)) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(size=8),
axis.title.x = element_text(size=8),
axis.title.y = element_text(size=8)
)
grid.arrange(p3,p4, p5,p6,
nrow=2,ncol=2,
top = textGrob("Average Transaction Frequency by Hours",
gp=gpar(fontsize=14,font=3)))
From the figures above, coffee and hot chocolate are sold more in the morning, as opposed to tea and hot chocolate. Sales of cake peak in the afternoon. It is likely that in weekends, people hang out together at the bakery to get the cake and multiple drinks. The bakery could make a special weekend afternoon combo for cake and drinks.
Based on the observations, the recommendations to the bakery would be:
During weekends, place the cake in a visible place closer to the cashier
Keep the combo of multiple drinks and cake
In this section, we will explore the different clusters of transactions, if there is any.
Weekday
cluster_wkday_scale = scale(cluster_wkday)
fviz_nbclust(
cluster_wkday_scale,
kmeans,
k.max = 10,
method = "wss"
) + ggtitle ("Elbow Method")
fviz_nbclust(
cluster_wkday_scale,
kmeans,
k.max = 10,
method = "silhouette"
) + ggtitle ("Silhouette Method")
Both figures suggest that the best K should be 6. However, after trying several different K, we decide to go with K = 3.
set.seed(0)
km_out_2 <- kmeans(cluster_wkday_scale, 4, nstart = 25)
fviz_cluster(km_out_2, cluster_wkday_scale, geom = 'point', ellipse.type = 'norm',
main = 'Weekday Dataset: K-Means Clustering Results with K=4') +
scale_fill_manual(values = cookiecol[c(7,5,1,4)]) +
scale_color_manual(values = cookiecol[c(7,5,1,4)]) +
theme_minimal()
Using PCA, we are able to visualize the clusters in a 2-D plane. Then to understand how well we split the transactions into two clusters, here we compare the centers of the clusters.
library('DMwR')
unscale(km_out_2$centers, cluster_wkday_scale)
## Rush_hours Holiday Item_Type temp rhum wdir wspd coco
## 1 1.000000 1 3.926362 7.379971 77.18851 216.1267 16.18395 1.319588
## 2 1.981895 1 4.452211 9.124211 73.58316 236.4000 21.98800 1.267368
## 3 1.869395 1 3.942265 4.073991 87.50224 177.8363 10.30409 1.168722
## 4 1.882759 2 4.400000 8.579310 79.96552 234.8276 22.35103 1.027586
## Hour
## 1 16.45066
## 2 12.54063
## 3 10.92265
## 4 12.73103
In general, one of the clusters only contains the non-rush hour transactions. Then the other two clusters seem to be split based on weather conditions, including humidity, wind direction and wind speed.
cluster_wkday_cbind<-cbind(km_out_2$cluster, cluster_wkday)
cluster_wkday1<-cluster_wkday_cbind%>%filter(km_out_2$cluster==1)
cluster_wkday2<-cluster_wkday_cbind%>%filter(km_out_2$cluster==2)
cluster_wkday3<-cluster_wkday_cbind%>%filter(km_out_2$cluster==3)
cluster_wkday4<-cluster_wkday_cbind%>%filter(km_out_2$cluster==4)
names(cluster_wkday_cbind)[1]<-'cluster_no'
cluster_wkday_cbind$cluster_no<-as.factor(cluster_wkday_cbind$cluster_no)
cluster_wkday_cbind %>%
ggplot( aes(x=as.factor(Item_Type),fill=cluster_no)) + facet_wrap(~ cluster_no) +
geom_bar(stat = 'count',alpha=0.6, position = 'identity', binwidth = 0.5) +
scale_fill_brewer(palette = 'BrBG') +
theme_minimal()
## Warning: Ignoring unknown parameters: binwidth
From the graph above, cluster 2 has the highest number of transactions while cluster 3 has the lowest which is explained by the holiday data in cluster 3. The number of transactions of cluster 1,2,4 which all contain non-holiday data, are still quite different. Let’s look at the details of other features in these two clusters.
p1<-cluster_wkday1 %>%
ggplot(aes(x=as.factor(Item_Type))) +
geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[1]) +
theme_minimal()
p2<-cluster_wkday1 %>%
ggplot(aes(x=as.factor(Rush_hours))) +
geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[1]) +
theme_minimal()
p3<-cluster_wkday1 %>%
ggplot(aes(x=as.factor(Holiday))) +
geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[1]) +
theme_minimal()
p4<-cluster_wkday1 %>%
ggplot(aes(x=temp)) +
geom_density(color=cookiecol[1], fill=cookiecol[1], alpha=0.6, position = 'identity') +
theme_minimal()
p5<-cluster_wkday1 %>%
ggplot(aes(x=wspd)) +
geom_density(color=cookiecol[1], fill=cookiecol[1], alpha=0.6, position = 'identity') +
theme_minimal()
p6<-cluster_wkday1 %>%
ggplot(aes(x=rhum)) +
geom_density(color=cookiecol[1], fill=cookiecol[1], alpha=0.6, position = 'identity') +
theme_minimal()
grid.arrange(p1,p2,p3,p4,p5,p6, nrow=4,ncol=2)
Cluster 1 contains data in non-rush-hours in non-holidays, with a mean temperature of around 7, relatively low wind speed , high humidity, and low number of transactions in different item types.
p1<-cluster_wkday2 %>%
ggplot(aes(x=as.factor(Item_Type))) +
geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[5]) +
theme_minimal()
p2<-cluster_wkday2 %>%
ggplot(aes(x=as.factor(Rush_hours))) +
geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[5]) +
theme_minimal()
p3<-cluster_wkday2 %>%
ggplot(aes(x=as.factor(Holiday))) +
geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[5]) +
theme_minimal()
p4<-cluster_wkday2 %>%
ggplot(aes(x=temp)) +
geom_density(color=cookiecol[5], fill=cookiecol[5], alpha=0.6, position = 'identity') +
theme_minimal()
p5<-cluster_wkday2 %>%
ggplot(aes(x=wspd)) +
geom_density(color=cookiecol[5], fill=cookiecol[5], alpha=0.6, position = 'identity') +
theme_minimal()
p6<-cluster_wkday2 %>%
ggplot(aes(x=rhum)) +
geom_density(color=cookiecol[5], fill=cookiecol[5], alpha=0.6, position = 'identity') +
theme_minimal()
p7<-cluster_wkday2 %>%
ggplot(aes(x=Hour)) +
geom_density(color=cookiecol[5], fill=cookiecol[5], alpha=0.6, position = 'identity') +
theme_minimal()
grid.arrange(p1,p2,p3,p4,p5,p6, nrow=3,ncol=2)
Cluster 2 contains data in rush hours in non-holidays, with relatively high temperature, relatively high wind speed, relatively low humidity, and relatively high number of transactions of different items.
p1<-cluster_wkday3 %>%
ggplot(aes(x=as.factor(Item_Type))) +
geom_bar(stat = 'count',alpha=0.8, position = 'identity', fill=cookiecol[7]) +
theme_minimal()
p2<-cluster_wkday3 %>%
ggplot(aes(x=as.factor(Rush_hours))) +
geom_bar(stat = 'count',alpha=0.8, position = 'identity', fill=cookiecol[7]) +
theme_minimal()
p3<-cluster_wkday3 %>%
ggplot(aes(x=as.factor(Holiday))) +
geom_bar(stat = 'count',alpha=0.8, position = 'identity', fill=cookiecol[7]) +
theme_minimal()
p4<-cluster_wkday3 %>%
ggplot(aes(x=temp)) +
geom_density(color=cookiecol[7], fill=cookiecol[7], alpha=0.8, position = 'identity') +
theme_minimal()
p5<-cluster_wkday3 %>%
ggplot(aes(x=wspd)) +
geom_density(color=cookiecol[7], fill=cookiecol[7], alpha=0.8, position = 'identity') +
theme_minimal()
p6<-cluster_wkday3 %>%
ggplot(aes(x=rhum)) +
geom_density(color=cookiecol[7], fill=cookiecol[7], alpha=0.8, position = 'identity') +
theme_minimal()
grid.arrange(p1,p2,p3,p4,p5,p6, nrow=3,ncol=2)
Cluster 3 contains holiday data which has even lower number of transactions comparing to cluster 1 (non-rush hour in non-holiday) even though temperature is relatively higher, humidity is relatively lower.
p1<-cluster_wkday4 %>%
ggplot(aes(x=as.factor(Item_Type))) +
geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[8]) +
theme_minimal()
p2<-cluster_wkday4 %>%
ggplot(aes(x=as.factor(Rush_hours))) +
geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[8]) +
theme_minimal()
p3<-cluster_wkday4 %>%
ggplot(aes(x=as.factor(Holiday))) +
geom_bar(stat = 'count',alpha=0.6, position = 'identity', fill=cookiecol[8]) +
theme_minimal()
p4<-cluster_wkday4 %>%
ggplot(aes(x=temp)) +
geom_density(color=cookiecol[8], fill=cookiecol[8], alpha=0.6, position = 'identity') +
theme_minimal()
p5<-cluster_wkday4 %>%
ggplot(aes(x=wspd)) +
geom_density(color=cookiecol[8], fill=cookiecol[8], alpha=0.6, position = 'identity') +
theme_minimal()
p6<-cluster_wkday4 %>%
ggplot(aes(x=rhum)) +
geom_density(color=cookiecol[8], fill=cookiecol[8], alpha=0.6, position = 'identity') +
theme_minimal()
grid.arrange(p1,p2,p3,p4,p5,p6, nrow=3,ncol=2)
cluster 4 represents (mostly) rush-hours in non-holiday, with a mean temperature of close to 5, relatively low wind speed, relatively high humidity, and relatively higher number of transactions in different item types.
In conclusion:
Since all clusters except for cluster 1 have quite a lot overlapping, we can only obtain limited information from the clustering results now. We can only say that lower number of transactions tends to happen in holidays as opposed to non-holidays.
Weekend
cluster_wkend_scale = scale(cluster_wkend)
fviz_nbclust(
cluster_wkend_scale,
kmeans,
k.max = 10,
method = "wss"
) + ggtitle ("Elbow Method")
fviz_nbclust(
cluster_wkend_scale,
kmeans,
k.max = 10,
method = "silhouette"
) + ggtitle ("Silhouette Method")
Both figures suggest that the best K should be 8. However, after trying several differnt K, we select K = 3.
set.seed(0)
km_out_2 <- kmeans(cluster_wkend_scale, 3, nstart = 25)
fviz_cluster(km_out_2, cluster_wkend_scale, geom = 'point', ellipse.type = 'norm',
main = 'Weekend Dataset: K-Means Clustering Results with K=3') +
scale_fill_manual(values = cookiecol[c(5,1,7)]) +
scale_color_manual(values = cookiecol[c(5,1,7)]) +
theme_minimal()
Then to understand how well we split the transactions into two clusters, here we compare the centers of the clusters.
unscale(km_out_2$centers, cluster_wkend_scale)
## Rush_hours Holiday Item_Type temp rhum wdir wspd coco
## 1 1.911582 1 4.379826 4.418431 89.62391 161.5068 13.04633 1.724782
## 2 1.815981 1 4.641646 9.690880 75.37772 249.4592 17.01324 1.045198
## 3 1.888889 2 4.311111 7.911111 78.13333 245.0000 36.68000 1.688889
## Hour
## 1 11.59651
## 2 13.07345
## 3 12.06667
cluster_wkend_cbind<-cbind(km_out_2$cluster, cluster_wkend)
cluster_wkend1<-cluster_wkend_cbind%>%filter(km_out_2$cluster==1)
cluster_wkend2<-cluster_wkend_cbind%>%filter(km_out_2$cluster==2)
cluster_wkend3<-cluster_wkend_cbind%>%filter(km_out_2$cluster==3)
names(cluster_wkend_cbind)[1]<-'cluster_no'
cluster_wkend_cbind$cluster_no<-as.factor(cluster_wkend_cbind$cluster_no)
cluster_wkend_cbind %>%
ggplot( aes(x=as.factor(Item_Type),fill=cluster_no)) + facet_wrap(~ cluster_no) +
geom_bar(stat = 'count',alpha=0.8, position = 'identity', binwidth = 0.5) +
scale_fill_manual(values = cookiecol[c(5,1,7)]) +
xlab('Item Type') +
ylab('Number of Transactions') +
theme_minimal()
## Warning: Ignoring unknown parameters: binwidth
From the graph above, cluster 2 has the highest number of transactions while cluster 3 has the lowest which is explained by the holiday data in cluster 3. The number of transactions of cluster 1 and cluster 2, which contain both non-holiday data, are still quite different. Let’s look at the details of other features in these two clusters.
cluster_wkend_cbind$cluster_no<-as.factor(cluster_wkend_cbind$cluster_no)
p1<-cluster_wkend_cbind %>%
filter(cluster_no%in%c(1,2)) %>%
ggplot(aes(x=Hour, fill=cluster_no)) +
geom_density( color="#e9ecef", alpha=0.8, position = 'identity') +
scale_fill_manual(values = cookiecol[c(5,1)]) +
theme_minimal() +
labs(fill="")
p2<-cluster_wkend_cbind %>%
filter(cluster_no%in%c(1,2)) %>%
ggplot(aes(x=temp, fill=cluster_no)) +
geom_density( color="#e9ecef", alpha=0.8, position = 'identity') +
scale_fill_manual(values = cookiecol[c(5,1)]) +
theme_minimal() +
labs(fill="") +
xlab('Temperature')
p3<-cluster_wkend_cbind %>%
filter(cluster_no%in%c(1,2)) %>%
ggplot(aes(x=wspd, fill=cluster_no)) +
geom_density( color="#e9ecef", alpha=0.8, position = 'identity') +
scale_fill_manual(values = cookiecol[c(5,1)]) +
theme_minimal() +
labs(fill="") +
xlab('Wind Speed')
p4<-cluster_wkend_cbind %>%
filter(cluster_no%in%c(1,2)) %>%
ggplot(aes(x=rhum, fill=cluster_no)) +
geom_density( color="#e9ecef", alpha=0.8, position = 'identity') +
scale_fill_manual(values = cookiecol[c(5,1)]) +
theme_minimal() +
labs(fill="") +
xlab('Humidity')
grid.arrange(p1,p2,p3,p4,nrow=2,ncol=2)
Cluster 2 tends to contain data in the afternoons, with relatively higher temperature, higher wind speed and relatively lower humidity. However, there is still a lot of overlapping between two clusters, so limited information could be derived from the clustering results now.
In a nutshell, with the bakery transaction data and weather record data, we first separated weekdays from weekend. Then we explored how different features influencing the number of transactions, especially those weather-related variables. After that we dived into the association rules from the transaction history, identified purchasing patterns, and provided recommendations for the bakery to further increase sales. In the end we tried to cluster the transactions into different segments, and discussed the initial results.
The major next steps of this analysis are:
When predicting the number of transactions, the selected random forest model could only explain 30% - 40% of variance. We will need to either acquire more data beyond Oct 2016 and Apr 2017, or get more new features. Our current features only include date, time, name of items, and weather. If we could manage to get the information of the client, we could further improve our model to perform the prediction task.
The association rules we identified has low support in general. If we could get more data, we might be able to further explore the the possible rules.
We could dive deeper into the clustering analysis, to further explore the similarities of transactions within the same cluster. If we could acquire more features, such as the demographic information about the client, we could further explore the clustering analysis and eventually turn the findings into executable actions to the bakery.
Whether a day is a public holiday or not is not significant in any of our analysis. Part of the reason is that our dataset only lasts 6 months. If we could acquire more data, covering several years, we could further explore the impact of holiday on the sales of the bakery.